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Abstract. Understanding the complexity of human language requires an appropriate analysis of the statis- 
tical distribution of words in texts. We consider the information retrieval problem of detecting and ranking 
the relevant words of a text by means of statistical information referring to the spatial use of the words. 
Shannon's entropy of information is used as a tool for automatic keyword extraction. By using The Origin 
of Species by Charles Darwin as a representative text sample, we show the performance of our detector and 
compare it with another proposals in the literature. The random shuffled text receives special attention as 
a tool for calibrating the ranking indices. 
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- 89.70.+C Information theory and communication theory 
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1 Introduction 

Data mining for texts is a well-established area of natural 
language processing pQ. Text mining is the computerised 
extraction of useful answers from a mass of textual in- 
formation by machine methods, computer- assisted human 
ones, or a combination of both. A key problem in text 
mining is the extraction of keywords from texts for which 
no a priori information is available. The problem of unsu- 
pervised extraction of relevant words from their statistical 
properties was first addressed by Luhn [2], who based his 
method on Zipf's analysis of frequencies [3]. This analy- 
sis consists of counting the number of occurrences of each 
distinct word in a given text, and then generating a list 
of all these words ordered by decreasing frequency. In this 
list, each word is identified by its position or Zipf's rank 
in the list. The empirical observation of Zipf was that the 
frequency of occurrence of the r-th rank in the list is pro- 
portional to r _1 (Zipf's law). Luhn proposed the crude 
approach of excluding the words at both ends of the Zipf's 
list and considering as keywords the remaining cases. The 
limitations of Luhn's approach are known in the litera- 
ture [4]. 

The main goal of this work is to investigate unsuper- 
vised statistical methods for detecting keywords in literacy 
texts beyond the simple counting of word occurrences. In 
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order to obtain statistically significant results we restrict 
our work to a large book, which can be used as a cor- 
pus what is thematically consistent throughout its entire 
length. We are searching for relevance according to the 
text's context, but we will only use statistical information 
about the spatial use of the words in a text. 

Particularly, the measure of content of information for 
each word can be made by Shannon's entropy. In the 
physics literature, we can find several applications of the 
entropy concept to linguistics and natural language like 
DNA sequences analysis 5 7 , long-range correlations 
measurements |8|9) . language acquisition [10], authorship 
disputes [11112) , communication modelling [13] , and statis- 
tical analysis of the linguistic role of words in corpora [14] . 

The organisation of the remainder of the article is as 
follows. In Sec. [2] we first introduce the corpus used as 
a representative sample throughout this work. Later, in 
Sec. |3] we review the algorithms proposed in the litera- 
ture based on the analysis of the statistical distribution of 
words in a text. Then, in Sec.|4]we discuss the behaviour of 
the indices in random texts. By using Shannon's entropy, 
in Sec. Owe propose another index based on the informa- 
tion content of the sequence of occurrences of each word 
in the text. In Sec. [6] we use the glossary of the corpus for 
measuring the performace of each index as keyword detec- 
tor. Finally, in Sec. [7] we present a summary of the work. 
Besides, mathematical details are given in appendices. In 
Appendix [A] we review the geometrical distribution, use- 
ful to random texts, and in Appendix [B] we calculate the 
entropy of a random text. 
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2 Representative Corpus Sample 

For our study, we will use a prototypical real text, i.e., 
"On the Origin of Species by Means of Natural Selection, 
or The Preservation of Favoured Races in the Struggle for 
Life" |15j (usually abbreviated to The Origin of Species) 
by Charles Darwin (1859). The book was written with the 
vocabulary of a nineteenth-century naturalist but with 
fluid prose, combining first-person narrative with schol- 
arly analysis. 

For the preparation of our working corpus we first 
withdrew any punctuation symbol from the text, mapped 
all words to uppercase and then used the simple tokeniza- 
tion method based on whitespaces [T] . We draw a distinc- 
tion between a word token versus a word type. For our 
convenience, we define a word type as any different string 
of letters between two whitespaces. Thus, for our elemen- 
tary analysis, words like INSTINCT and INSTINCTS corre- 
spond to different word types in our corpus. On the other 
hand, a word token is each individual occurrence of a given 
word type. When the context refers to a particular word 
type, we will use indistinctly "word token" or simply "to- 
ken" to refer to an individual occurrence of the word type 
in the text. 

The relevant words have not been explicitly defined in 
Darwin's book, with exception of a glossary appended at 
the end of the work. Therefore, the table of contens in the 
beginning, the glossary and the analytical index, also in- 
serted at the end, were removed from our corpus. By doing 
this, we avoid introducing obvious bias for the words used 
in these parts. Thus, the prepared corpus includes 94% of 
material from the original Darwin's book and has 192, 665 
word tokens and 8, 294 word types. In addition, the corpus 
contains 842 paragraphs distributed in 16 chapters. 

The glossary of the principal scientific terms used in 
the book, prepared by Mr. W.S. Dallas, and the analyt- 
ical index, both appended at the end of the book, were 
written using 2, 418 word types. If we do not consider the 
function words, still remain 1, 679 word types (20% of the 
book's lexicon). With this information,, we prepared by 
hand a customed version of the glossary, by selecting 283 
word types (3.4% of the lexicon) with frequencies of occur- 
rence greater than 9. We have avoided word types with less 
than 9 occurrences because we cannot extract any signif- 
icant statistics from data obtained using such small sets. 
Thus, the criterion for selection was rather more arbitrary, 
but we think that all selected words are pertinent to the 
book's context. Our prepared version of the glossary will 
be used later to evaluate the retrieval capabilities of dif- 
ferent keyword extractors. 



3 Clustering as criterion for relevance of 
words 

The attraction between words is a phenomenon that plays 
an important role in both language processing and acqui- 
sition, and it has been modeled for information retrieval 
and speech recognition purposes |16|17j . Empirical data 
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Fig. 1. Histogram of frequencies of distances between occur- 
rences of NATURAL (Zipf 's rank 53) in Darwin's corpus. 



reveals that the attraction between words decays expo- 
nentially, while stylistic and syntactic constraints create 
a repulsion between words that discourages close occur- 
rences. In Fig. (QJ we have plotted the histogram of ab- 
solute frequencies of distances between nearest neighbour 
tokens of the word type NATURAL in Darwin's corpus. 
For long distances, Fig. |T]) qualitatively suggests an ex- 
ponential tail, but for very short distances the frequencies 
decay abruptly. Also in Fig. (JTJ we have superimposed 
the histogram of a random shuffled version of the corpus 
where we can qualitatively see an exponential decay for all 
distances. The attraction-repulsion phenomenon is more 
emphasized for relevant words than for common words, 
which have less syntactic penalties for close co-occurrence. 
Therefore, the spatial distributions of relevant words in 
the text are inhomogeneous and these words gather to- 
gether in some portions of the text forming clusters. The 
clustering phenomenon can be visualised in Fig. [2] where 
we have plotted the absolute positions of four different 
word types from Darwin's corpus in a "bar code" arrange- 
ment. The clustering becomes manifest in the patterns of 
NATURAL, LIFE, and INSTINCT in spite of their differ- 
ent numbers of occurrences. In contrast, THE (the more 
frequent word in the English language) has no apparent 
clustering. 

Recently, the assumption that highly relevant words 
should be concentrated in some portions of the text was 
used for searching relevant words in a given text. In the 
following two subsections, we briefly review the indices of 
relevance of words proposed by Ortuho et al. 18J and Zhou 
and Slater [TS] , which are based on the spatial distribution 
of words in the text. 
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Fig. 2. Absolute positions (t) in the text, counted from the 
beginning of Darwin's corpus, of the word types: THE (13, 414 
occurrences), NATURAL (475 occurrences), LIFE (326 occur- 
rences), and INSTINCT (64 occurrences). To draw the picture, 
we set a very thin vertical line (of arbitrary height) at the po- 
sition of each occurrence. 



3.1 er-index 

To study the spatial distribution of a given word type in 
a text, we can map the occurrences of the corresponding 
word tokens into a time series. For this task, we denote 
by ti the absolute position in the corpus of the i— th oc- 
currence of a word token. Thus, we obtain the sequence 
{to , t\ , . . . , t n , i„+i } , where we are assuming that there are 
n word tokens. We have additionally included the bound- 
aries of the corpus, defining to = and t n+ \ = N + 1, 
where N is the total number of tokens in the corpus, in 
order to take into account the space before the first oc- 
currence of a word token and the space after the last oc- 
currence of a token [19] . 

Given the sequence of inter-token distances 

{ti — to, t-z — ii, . . . ,t n — < n -i, t n+ i — t n } , 

the average distance between two successive word tokens 
is given by 



1 " 

i=0 



ti) = 



N + l 
n+l 



(1) 



and the sample standard deviation of the set of spacings 
between nearest neighbour word tokens (tj+i — ti) is by 



\ 



1 " 

i=0 



(2) 



To eliminate the dependence on the frequency of oc- 
currence for different word types, in Ref. [18] the authors 
suggest to normalise the token spacings, i.e., to measure 
them in units of their corresponding mean value. Thus, we 
define 

(3) 



s 

a = — 



Given that the standard deviation grows rapidly when 
the inhomogencity of the distribution of spacing ti + \ — ti 
increases, Ortuno et al. [TH] proposed er as an indicator of 
the relevance of the words in the analysed text. In many 
cases, empirical evidence vindicates that large a values 
generally correspond to terms relevant to the text consid- 
ered, and that common words have associated low values 
of er. However, Zhou and Slater [19j pointed out that er- 
index has some weaknesses. First, several obviously com- 
mon (relevant) words have relative high (low) er values in 
several texts. Second, the index is not stable in the sense 
that it can be strongly affected by the change of a single 
occurrence position. Third, high values of a do not al- 
ways imply a cluster concentration. A big cluster of words 
can be splitted into smaller clusters without substantial 
change in the a value. 



3.2 T-index 

The er-index is only based on the spacing between nearest- 
neighbour word tokens. To improve the performance in the 
searching for relevance, Zhou and Slater [19] introduced a 
new index that uses more information from the sequence 
of occurrences {t ,ti, . . . ,t n ,t n+ i}. For this task, these 
authors consider the spacings Wi = ti — ij-i, with i = 
1, . . . , n + 1, and define the average separation around the 
occurrence at ti as 



d(ti) = 



w i+ i + Wi ti+i - ti 



1, 



(4) 



2 2 

The position ti is said to be a cluster point if d(ti) < [i. 
The new suggestion is that the relevance of a word in a 
given text is related to the number of cluster points found 
in it. Thus, in order to measure the degree of clusteriza- 
tion, the local cluster index at position ti is defined by 



fx - d(U) 







(1 



if ti is a cluster point 
otherwise 



(5) 



Finally, a new index to measure relevance is obtained from 
the average of all cluster indices corresponding to a given 
word type 

1 



r 



i=l 



(G) 



-T-index is more stable than er, but it is still based on local 
information and is computationally more time consuming 
to evaluate than er. 



4 



Juan P. Herrera, Pedro A. Pury: Statistical Keyword Detection in Literary Corpora 



4 Random text and shuffled words 

In a completely random text we have an uncorrelated se- 
quence of tokens, and a word type w is only characterised 
by its relative frequency of occurrence (p w ). Thus, a ran- 
dom text can be generated by picking successively tokens 
by chance in such a way that at each position the proba- 
bility of finding a token, corresponding to the word type 
w, is p w . Obviously, ^2 w p w = 1- For the word type w, we 
have in this manner defined a binomial experiment where 
the probability of success (occurrence) at each site in the 
text is p w , and the probability of failure (non-occurrence) 
is (1 — p w ). Therefore, the distribution of distances be- 
tween nearest neighbour tokens corresponding to the same 
word type is geometrical. In Appendix |Al we have com- 
piled some results of the geometrical distribution that are 
useful for our next analyses. 

Besides, its worth as comparative standard, the the- 
oretical random text has the virtue of being analytically 
tractable. Also, from an empirical point of view, there is 
a workable fashion for building a random version of a cor- 
pus. In an actual corpus the probabilities of occurrence 
p are estimated from the relative frequencies n/N , where 
n is the number of tokens corresponding to a given word 
type and N is the total number of tokens in the corpus. 
A random version of the text can be obtained by shuffling 
or permuting all the tokens. The random shuffling of all 
the words has the effect of rescasting the corpus into a 
nonsensical realization, keeping the same original tokens 
without discernible order at any level. However, both the 
Zipf's list of ranks and the frequency of occurrence of each 
word type are kept intact. 

The important point that we want to stress here is that 
the indices of relevance defined in the previous section are 
functions of the frequencies of occurrence of each word 
type. Thus, in a random text the values of these indices 
change with p, which has nonsense. In a truly random 
text, there are not relevant words. Therefore, to eliminate 
completely the dependence on frequency we need to renor- 
malise the indices with their values in the random version 
of the corpus. 



4.1 Renormalised cr-index 

For a given probability distribution, a is defined from the 
second- (pz) and first-order (/ii) cumulant by ■JJI^/fJ-i- 
Thus, from Eq. ([20]) in Appendix [X] we find that in a 
random text the value of cr-index is given by 



"ran 



(7) 



Hence, we renormalise the index to eliminate this depen- 
dence on relative frequency defining 



f 



Cnor — 



fl y/l-p 



(8) 



For texts as large as corpora the importance of normalisa- 
tion factor given by Eq. ([7]) becomes negligible. For exam- 
ple, in Darwin's corpus, N = 192,665, and for the most 
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Fig. 3. Renormalised cr-index vs. Zipf's rank for each word 
in Darwin's corpus (the first 4000 ranks). We have also plot- 
ted superposed the random version of the text (grey) and we 
have marked by open circles the words corresponding to our 
prepared glossary (red online). 



frequent word type (THE) we have n = 13,414 (n/N = 
0.0696). Thus, in the less significant case (the lowest value 
for cr r an) "ran = 0.965, whereas <7ran = 1 for p = 0. How- 
ever, for shorter texts the significance of the normalisation 
may become critical and the values of a and crnor may be 
very different for any word type. 

In Fig. [3] we plot the values of <7nor for the first 4000 
ranks in the Zipf's list of Darwin's corpus. The random 
version of the corpus is also plotted in the same graph. 
The "cloud of points" corresponding to the random text 
is distributed around the unitary value of <7nor, but the 
width of the "cloud" grows with rank. This behaviour is 
due to the fact that the frequency of occurrence decreases 
as the rank increases (Zipf's law), therefore the statistics 
get worse. The words of our preparated version of the 
glossary are marked by open circles in Fig. [3l From Fig. [3l 
it is appreciable that most of the glossary words have high 
values of crnor- 



4.2 Renormalised skewness 

As in the case of cr, any cumulant contains partial infor- 
mation of the spatial distribution of words. Skewness is 
a parameter that describes the asymmetry of a distribu- 
tion. Mathematically, the skewness is measured using the 
second- (/^) and third-order (/L/,3) cumulant of the distri- 
bution according to k = ^3/ . Given that the distances 
between nearest neighbour tokens are positive defined, the 
corresponding distribution has positive skew, i.e., the up- 
per tail is longer than the the lower tail (see Fig. [T]). 
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From Eq. (|20|) , we find that in a random text the skew- 
ness of the distribution of distances between nearest neigh- 
bour tokens is given by 



Kran 



P 



(9) 



Thus, the skewness also depends on the relative frequency 
of occurrence, p, in the random case. However, this depen- 
dence is also negligible for a corpora. In Darwin's corpus 
we obtain «ran = 2.001 for the largest value p = 0.0696 
(the relative frequency of the word type THE), whereas 
^ran = 2 for p = 0. 

As a consequence, we can define another renormalised 
quantity as we did with the cr-index. Thus, to eliminate 
the dependence on the relative frequency of occurrence in 
the random case, we write 



^nor 



M3 



P 



3/2 



P 



(10) 



Knor can also be used for measuring relevance. However, 
the finite-size effects of the texts are more pronounced 
for higher order cumulants. We now use both cumulants 
o"nor and Knor to construct a bi-dimcnsional graph for 
the corpus. In this manner, in Fig. [4] we plot the the 
pairs (anor, ^nor) ^ or a ^ wor( l s m Darwin's corpus. In this 
graph, the "cloud of points" corresponding to the random 
text is distributed around the pair of values (1,1), while 
the region defined by Unor > 2 and ftnor > 2 has almost 
none. The upper right corner of the graph concentrates al- 
most all the points corresponding to the glossary. Figured] 
gives us immediate insight into the distribution of dis- 
tances between nearest neighbour tokens, and provides us 
a powerful tool for determining keywords. 
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Fig. 4. Renormalised K-index vs. cr-index for all words in 
Darwin's corpus. We have also plotted superposed the random 
version of the text (grey) and we have marked by open circles 
the words corresponding to our prepared glossary (red online) . 



the Zipf 's list of Darwin's corpus. The "cloud of points" 
corresponding to the random text is distributed around 
the unitary value, but the width of the "cloud" grows 
with rank faster than in the case of Cnor- The words cor- 
responding to the glossary have systematically high values 
of /nor- 



4.3 Renormalised /'-index 

As we did with the cr-index, we need to calculate r for a 
word type which appears in a random text with relative 
frequency p. For this task, we calculate the average of the 
random variable 7 defined in Eq. (|5|) in a random text. 
From Eq. ([30| in Appendix [Al we obtain 

/ran = \h{h-l) (1 -pf ((1 -p) + (1 -p)- 1 - 2) , (11) 

where h = Int[2/p]. In this case, the dependence on p is 
even more complicated than previous cases. This observa- 
tion is absent from Ref. [T^]. Zhou and Slater only calcu- 
late the value of r for the Poisson distribution: r = 2 e -2 
(see Eq. (|3"3"|) in Appendix [A")), which is constant (w 0.271). 
Also in this case, the dependence on p is negligible for cor- 
pora. In Darwin's corpus we obtain Iran = 0.261 for the 
largest value of p = 0.0696 (the relative frequency of the 
word type THE), whereas Iran ~ 0.271 in the limit p — > 
(see Appendix [2} . 

Now, as in the other cases, we define from Eqs. ((6|) 
and (|11[) a renormalised index by rfior = I^/Iran- In 
Fig. [5] we plot the values of Inor for the first 4000 ranks in 



5 Entropy of token distributions 

Claude Shannon introduced the concept of entropy of in- 
formation in 1948 [20]. Mapping a discrete information 
source on a set of possible events whose probabilities of oc- 
currences are p\,p2, ■ ■ ■ ,Pp, Shannon constructed a mea- 
sure of information and uncertainty, S(p\,p2 1 ■ ■ ■ ,pp), re- 
quiring the following properties: 

1. S should be continuous in the {pt}- 

2. For the iso-probability case, Pi = 1/P, S should be a 
monotonic increasing function of P. 

3. If the set p\ , pi , . . . , pp is broken down into two subsets 
with probabilities w± — pi + . . . + Pk and W2 = Pk+i + 
. . . + pp, then we must have the following composition 
law S(pi, . . .p N ) = S(w 1 ,w 2 )+w 1 S(pi/wi, . . .pk/wi) 
+ w 2 S{pk+i/w 2 , ■ .-pp/w 2 ). 

The only S satisfying the three above assumptions is of 
the form 

p 

S(pi,p 2 , ■ ■ ■ ,pp) = -K^pi logpi , (12) 

i=l 
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Fig. 5. Renormalised /"-index vs. Zipf's rank for each word 
in Darwin's corpus (the first 4000 ranks). We have also plot- 
ted superposed the random version of the text (grey) and we 
have marked by open circles the words corresponding to our 
prepared glossary (red online). 



where K is a positive constant. 

A literary corpus can be divided in parts using natural 
partitions such as parts, sections, chapters, paragraphs or 
sentences. Thus, we consider the corpus as a composite of 
P parts. For the z-th part of the corpus we can reckon up 
the total number iVj of tokens and the number rii(w) of 
occurrence of the word type w inside this part. Then, the 
fraction fi(w) = n i {w)/N i (i = l,..,,P) is the relative 
frequency of occurrence of the word type w in the part 
i. Obviously, YliLi Ni = N is the total number of tokens 

in the corpus and X)i=i n «( u ') = n(w) is the number of 
tokens corresponding to the word type w. Therefore, it 
is possible to define a probability measure over the parti- 
tions pT| as 

/iH 



Pi(w) 



(13) 



The quantity Pi(w) results more complex than the con- 
ditional probability fi(w)/(n(w)/N), of finding the word 
type w in the part i given that it is present in the corpus. 

Following Shannon's arguments, the information en- 
tropy associated with the discrete distribution Pi(w) is 



S(w) 



I p 

-j^pyE^( w ) ln 0*H) 



(14) 



The value l/ln(P) for the constant K was selected to 
take the maximum value of S equal to one. Thus, < 



S(w) < 1. In this manner, when a type word is uniformly 
distributed (p { = 1/P, for all i), Eq. Q3|) yields S = 1. 
Conversely, the other extreme case, 5* = 0, is when a word 
type appears only in part j, thus we have pj — 1 and pi = 
for i ^ j. Therefore, words with frequent grammatical use 
like function words (prepositions, adverbs, adjectives, con- 
junctions, and pronouns) will have high values of entropy, 
meanwhile keywords will have low values of entropy. Em- 
pirical evidence [14] shows a tendency of the entropy to 
increase with n. It implies that, on average, the more fre- 
quent word types are more uniformly used. 

As we did with preceding indices, we need to calcu- 
late the average of the entropy of a mock word type that 
appears n times in a random corpus. From Eq. (|39l) in 
Appendix |BJ we obtain 



(l-S)r 



P- 1 
2n InP 



(15) 



for n >> 1 and if all the parts of the random text have 
the same number of tokens. Empirical evidence [14) shows 
that the agreement of Eq. (fT5")) with random shuffling of 
texts using natural partitions is very good, in spite of the 
limitation of the last assumption. From Eq. (115|) . we can 
see that the dependence on the absolute frequency, n, is 
critical for (1 — 5*)ran and it could not be ignored even if 
the text is as large as a corpus. 

Montemurro and Zanette [14] proposed Eqs. (fl~3]) and 
(fT4]) to study the distribution of words according to their 
linguistic role. For this task, they found that the suit- 
able coordinates whereby words can be categorized are 
n (1 — S) and n. In the same way, we will use these ideas 
for detecting relevance of words. We cannot use directly 
the entropy as index because all tokens with only one oc- 
currence have zero entropy. Thus, we define a normalised 
index freed from the dependence on absolute frequency 
(n) in random texts by 

PnorO) = n(w) (1 - S(w)) n or = n(w) 2 ° P (1 - S(w)) . 

(16) 

Figure [S] shows the values of Pnor for all word types of 
Darwin's corpus versus its number of occurrence, n, on a 
double logarithmic scale. The individual deviations from 
the bulk trend for each value of n are related to the par- 
ticular usage nuances of words. To stress these deviations, 
we have used the 16 chapters of the corpus as natural par- 
titions for our entropic analysis (i.e. P = 16). In this way, 
we obtain a remarkable scattering of higher values of -Ehor 
in the full range of number of occurrences. A same entropic 
analysis using the 842 paragraphs of Darwin's corpus as 
partitions (i.e. P = 842) generates a similar graph that 
stresses the bulk trend, but the fluctuations are completely 
smoothed. Using the chapters as partitions (P — 16) in 
Fig. O the "cloud of points" corresponding to the random 
version of the corpus is distributed around the unitary 
value and the corpus appears clearly more separated from 
the random text than with previous indices. Additionally, 
the words corresponding to the glossary have systemati- 
cally high values of the index -Ehor- 
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Fig. 6. -Enor vs. number of occurrence (n) for each word in 
Darwin's corpus. We have also plotted superposed the random 
version of the corpus (grey) and we have marked by open circles 
the words corresponding to our prepared glossary (red online) . 



To reinforce our graphical findings, in the following 
section we perform a quantitative comparison among the 
indices cr n0 r, ^nor, Tfior, and £hor based on the power of 
each index for discriminating the glossary from the bulk 
of words. 



Table 1. Recall and precision of each index. NG is the num- 
ber of glossary's word types among the first 283 entries of each 
ranking. LP is the last position in each ranking in which ap- 
pears a word type of the glossary. Thus, recall = NG /283 and 
precision = 283/LP. 



Index 


NG 


recall 


LP 


precision 


last word 


£rior 


118 


0.417 


2,790 


0.101 


FLOWERING 


fnor 


114 


0.403 


5,689 


0.050 


SCARCELY 


Knor 


107 


0.378 


4, 181 


0.068 


INDIAN 


Tnor 


72 


0.254 


4,312 


0.066 


OSTRICH 




recall 



6 Glossary as benchmark 

Evaluation in information retrieval makes frequent use of 
the notions of recall and precision |H4j . Recall is defined as 
the proportion of the target items that a system recover. 
Precision is defined as a measure of the proportion of se- 
lected items that are targets. Remembering that our pre- 
pared glossary has 283 word types, we denote by NG the 
number of the glossary's word types among the first top 
283 ranked word types of the corpus. For our purposes, 
we define recall of an index of relevance as the fraction 
NG/283. Thus, recall for the index E n0 r results 41%. On 
the other hand, precision can be built looking for the last 
word type of our prepared glossary in the global rank- 
ing of each index. For our convenience, we denote by LP 
the position in the ranking of the last word type of the 
glossary, and we define precision of a keyword extractor 
as the fraction 283/ LP. Thus, for example, the last entry 
of the glossary according to the index E n0T is FLOWER- 
ING and is ranked in the position 2, 790. Remembering 
that the corpus has 8, 294 word types, we obtain that the 
complete prepared glossary is allocated by -Enor in the 
first third part of the ranked lexicon and the precision 
of the index results 10%. Recall and precision are use- 



Fig. 7. Comparison of information retrieval performance of 
the indices (see Table [IJ. 



ful benchmarks for measuring the index's performance. In 
particular, recall and precision of each index analysed in 
this work are given in Table [TJ We want to stress that the 
values of recall and precision of the indices a and T are 
exactly the same as those obtained for Unor and /nor, 
respectively. This fact is due to the normalisation factors 
given by Eqs. ((7j) and (fTT) . which are almost constant for 
a corpus. Therefore, the pair of indices a and crnor (or T 
and /nor) yield identical rankings of keywords. In order to 
compare the performance of all indices, in Fig. [7] we have 
drawn a precision-recall plot where we can see the signif- 
icant improvement performed by the index -Enor, both in 
recall and precision. Also, in Fiq. we see that Knor has 
a recall slightly worse than crnor and precision as good 
as T n or- Thus, we find that the skewness of the distribu- 
tion of occurrences of a word type has a significant part of 
information about the relevance of the word in the text. 

In Table [2] we show the first top 50 word types of the 
prepared glossary ranked by the index E n0 r- We also show 
the rank position of each word type by the others indices. 
A false positive is when the system identifies a keyword 



8 



Juan P. Herrera, Pedro A. Pury: Statistical Keyword Detection in Literary Corpora 



Table 2. First top 50 word types of the prepared glossary ranked by the index Enor. The numerical values correspond to the 
positions in the ranking of each word type, not to the actual values of the indices. 



Word type 


£xior 


cnor 


inor 


Word type 


Enor 


fnor 


Aior 


HYBRIDS 


1 


2 


13 


SEA 


33 


65 


309 


STERILITY 


3 


1 


7 


SEEDS 


35 


64 


279 


SPECIES 


5 


447 


1312 


FERTILE 


37 


54 


135 


FORMS 


6 


185 


667 


ORGAN 


39 


14 


218 


VARIETIES 


7 


39 


384 


MOUNTAINS 


40 


120 


94 


INSTINCTS 


8 


3 


19 


GLACIAL 


41 


51 


113 


BREEDS 


9 


38 


142 


GARTNER 


43 


36 


20 


FERTILITY 


10 


8 


33 


HYBRID 


44 


46 


59 


FORMATIONS 


11 


20 


78 


CUCKOO 


47 


13 


3 


CROSSED 


12 


9 


82 


LAND 


48 


106 


613 


SELECTION 


13 


212 


858 


EGGS 


50 


109 


215 


ORGANS 


14 


61 


433 


STRUGGLE 


51 


829 


571 


NEST 


16 


22 


18 


BREED 


52 


332 


367 


INSTINCT 


17 


5 


32 


GEOLOGICAL 


54 


129 


456 


RUDIMENTARY 


18 


25 


130 


CROSS 


62 


125 


205 


FORMATION 


19 


144 


341 


HABITS 


63 


278 


1260 


BEES 


21 


6 


29 


STRUCTURE 


65 


105 


1451 


PLANTS 


22 


113 


776 


INHABITANTS 


67 


95 


556 


CELLS 


23 


18 


50 


FLOWERS 


68 


35 


250 


POLLEN 


24 


12 


74 


ANTS 


75 


41 


35 


NATURAL 


25 


460 


1288 


RACES 


78 


566 


542 


GROUPS 


26 


79 


393 


OFFSPRING 


81 


400 


884 


CROSSES 


27 


60 


81 


SEXUAL 


85 


89 


285 


WATER 


29 


75 


400 


VARIABLE 


87 


138 


467 


STERILE 


31 


19 


109 


WILD 


89 


235 


269 



Table 3. First 40 false positives word types ranked by the index -Enor and its numbers of occurrences n. The numerical values 
in the -Enor column correspond to the positions in the ranking of each word types, not to the actual values of the index. 



Word type 


Snor 


n 




Word type 


Enor 


n 




I 


2 


947 




NORTHERN 


60 


41 




ISLANDS 


4 


154 


* 


DESCENT 


61 


80 


* 


CHARACTERS 


15 


192 


* 


FRESH 


64 


50 


* 


GENERA 


20 


215 


* 


ITS 


66 


497 




WAX 


28 


42 




DIFFERENCES 


69 


168 




ISLAND 


30 


69 




CELL 


70 


30 




DOMESTIC 


32 


131 


* 


EXTINCT 


71 


116 


* 


YOUNG 


34 


127 




EUROPE 


72 


81 


* 


TEMPERATE 


36 


40 




FERTILISED 


73 


34 




SLAVES 


38 


34 




DIAGRAM 


74 


40 




NEW 


42 


278 




SHALL 


76 


105 




MY 


45 


99 




WE 


77 


1320 




INCREASE 


46 


82 




DEVELOPED 


79 


146 


* 


INTERMEDIATE 


49 


164 




BEDS 


80 


35 




PERIOD 


53 


245 


* 


ADULT 


82 


46 




MIVART 


55 


34 


* 


TWO 


83 


456 




THROUGH 


56 


249 




BETWEEN 


84 


367 




HE 


57 


236 




NUMBER 


86 


255 




F 


58 


37 




OCEANIC 


88 


42 


* 


PARTS 


59 


230 


* 


THEORY 


90 


131 
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that really is not one. In Table [3] we show the first top 
40 ranked (by -Enor) word types not included in our pre- 
pared glossary. We can immediately see that several terms 
are not necessarily false positives. We have marked with 
an asterisk (*) in the table those word types that were 
not previously selected in the prepared glossary, but that 
appeared in the main entries of the original glossary of 
Darwin's book. Indeed, several more word types like these 
could have been included in our prepared glossary, too. 
Moreover, we could say that the word type I is relevant for 
a text that uses the first-person narrative, like Darwin's 
book. ISLAND and SLAVES were not used neither in the 
book's glossary nor in its index; however -Error ranks it 
adequately as a keyword. The word type F is also mean- 
ingful to the text. It appear in the proper nouns "Mr. 
F. Smith" and "Dr. F. Muller" , and in the collocations 
"F. sanguinea", "F. rufescens", "F. fusca", "F. flava", and 
"F. rufescens" which denote species. The observations in 
the last paragraphs induce us to consider that the per- 
formance of the index -Error is better than what can be 
inferred from Table [TJ 

Moreover, the index Error requires less computational 
efforts that the others. Knowing the number of occur- 
rences of a word type, the implementation of the algorithm 
for the variance or the skewness requires of one accumu- 
lator plus a counter for reckoning the number of tokens 
between nearest neighbour occurrences of the word type. 
While, for the entropic index, we only need one counter (of 
number of occurrences) for each partition per word type. 
On the other hand, the algorithm for F requires three 
accumulators and for each occurrence of a word type we 
need to determine if it corresponds to a cluster point. 



A The Geometrical distribution 

In this Appendix we briefly review the basic results of the 
geometrical distribution, scattered in the literature, that 
are useful for this work. First, we consider an experiment 
with only two possible outcomes for each trial (binomial 
experiment) . Repeated independent trials of the binomial 
experiment are called Bernoulli trials if their probabilities 
remain constant throughout the trials. We denote by p the 
probability of the "successful" outcome. Now, we are in- 
terested in the probability of success on the j— th trial after 
a given success. Given that the trials are independent, we 
immediately obtain the geometrical distribution 



P{j) = {I-PY^P, forj>l. 



(17) 



A.l Moments and cumulants 



The characteristic function of a stochastic variable X is 
defined by G(k) = (e kx ) = J2j>i P(j) exp(fcj). Thus, for 
the geometrical distribution we obtain 



G(k) 



pe 



l-(l-p)e k 

This function is also the moment generating function 

d n G 



(18) 



(X n ) 



dk r ' 



(19) 



fc=0 



Therefore, the first three cumulants of the geometrical dis- 
tribution are given by 



7 Concluding remarks 

In summary, in this work we addressed the issue of statis- 
tical distribution of words in texts. Particularly, we have 
concentrated on the statistical methods for detecting key- 
words in literacy text. We reviewed two indices (a and r) 
previously proposed [18119) for measuring relevance and 
we improved them by considering their values in random 
texts. Additionally, we introduced Krior based on the skew- 
ness of the distribution of occurrences of a word and we 
proposed another index for keyword detection based on 
the information entropy. Our proposals are very easy to 
implement numerically and have performances as detec- 
tors as good as or better than the other indices. The ideas 
of this work can be applied to any natural language with 
words clearly identified, without requiring any previous 
knowledge about semantics or syntax. 
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Mi = (X) = -, 
P 

p 2 = (x 2 ) - (xy 



i-p 

p 2 



M3 = (X 3 ) - 3 (X 2 ) (X) + 2 (X) 3 = { - P) (1 P) 



p' 



(20) 



A. 2 Addition of two geometrical variables 



If X\ e X 2 are geometrical distributed independent ran- 
dom variables, the distribution of the addition Y = X\ + 
X 2 is 

p yU)= P(jn u ma), for j = 2, 3, . . . , 

mi-MTS2=2 

. . 

where the joint probability distribution of the variables 
Xx e X 2 , P(mi,m 2 ), is given by 

P(m 1 ,m 2 ) =P 2 (l-p)" ll+m2 ~ 2 ,for m x > l,and m 2 > 1 . 

(22) 

In this manner, 

j-l i-i 
PyU) = J2 P( - m >3 -™)=Y,P 2 (i-P)^ 2 ■ ( 23 ) 
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Therefore 

Pv(j) = (j - 1)P 2 (1 - P) 3 - 2 , for j = 2,3, ... . (24) 

Now, we are interested in the average of the random 
variable (recall Eq. (J5])) 



, Y > 2p 



(25) 



where Y is the addition of two independent geometrical 
distributed random variables with mean p = I /p. By def- 
inition we have that 



(7> -El 1 

i=2 



J_ 

2fi 



Py{n) , 



(26) 



where Py (n) is given by Eq. (j2"4")) and h = Int[2/i] . Defining 
q = 1 — p and using the identity 



N 

E«" = 



q-q 



N+l 



(27) 



we immediately obtain 



E p y^) = p 2 j- E = l - h I"' 1 + ( h ~ !) 

n=j ^ fc=2 



and 

p E j ivo') = p 3 ^ E« fc=2 -M^+i) ^ 

3=2 q k=2 

+2 (h + l)(h- 1) g' 1 - h (h - 1) . 
Therefore 

{l) = \h{h-l)q h {q + q- 1 -2). 



(28) 



(29) 



(30) 



The Poisson distribution can be obtained from the ge- 
ometrical distribution in the limit p — > 0. Expanding q z 
into a Taylor series up to fourth order we obtain 

q h+1 +q h - 1 -2q h ^p 2 + (l-h) P 3 + ^{2-3h + h 2 )p i . 

(31) 

Given that for p — > we have /i >> 1, the last equation 
can be recast as 



q h+l +q h-l _ 2g h ^ p 2 (1- ftp +1(^)2) 

ss p 2 exp (—hp) . 



(32) 



Finally, using that hp ~ 2, we obtain that the average of 
the random variable 7 for a Poisson distribution [19j is 



B Entropy of a random text 

Here, we derive the entropy of a random text in a more 
detailed way that is described in Ref. [T4] , 

We consider a corpus of N tokens as a composite of 
P parts, with Ni tokens in the z— th part (i = 1, 2, . . . , P). 
In a random corpus, the probability that a word type w 
appears in the part j is Nj/N. Thus, the probability that 
w appears n\ times in part 1, ni times in part 2, and so 
on, is the multinomial distribution 



p w (n 1 ,n 2 , . . ■ ,n P ) = n! 



1 



iV 



(34) 



where n = X^=i n i ^ s the absolute frequency (number of 
tokens) of the word type w. 

For reasons of simplicity, in this Appendix we consider 
the particular case in which all the parts have exactly 
the same number of tokens, i.e. Ni — N/P. Hence, the 
probability measure defined by Eq. ([IB"]) can be simply 
written as pi = n, jn and the information entropy defined 
by Eq. results 



In P n \nJ 



(35) 



Now, we are interested in the average value of the en- 
tropy over the distribution given by Eq. (134]) . We only 
need to compute the average of each term of Eq. (|35|) 
using the marginal distributions, p w (rii), obtained from 
Eq. (|34|) . All marginal distributions result binomials with 
mean n/P and variance n/P(l — 1/P)- Thus, we obtain 
for the average entropy 



(S) 



m /m 
In P n \ n 

m— 



n \ 1 
m) ~P~" 



1 



( 36 ) 

For highly frequent word types, n » 1, we can ap- 
proximate the binomial distribution by a Gaussian prob- 
ability function (G(x; /j,, a)) with mean p = 1/P and vari- 
ance er 2 = (l/n)(P- 1)/P 2 . Thus, Eq. (SB) can be recast 
as 

P f 1 

(S 1 ) w — / a; In x G(x; a, a)dx . (37) 

inPy 

In the limit n >> 1, cr — > and the Gaussian probability 
function concentrates around its mean value fi. Using the 
expansion of the function xhix around p, 

x\nx sa p\np + (1 + lnp)(x — p) + ~^~( x ~ m) 2 j (38) 

in Eq. (|3T[) and remembering that 

(x — /i) 2 G(a;; /x, ir)da; = a 2 , 

we finally obtain for a random text [14 that 

P- 1 
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